For today’s discussion, we will review the distinction between parametric and nonparametric regression models, as well as some key properties of simple linear regression. If you wish to review materials from POL 211 & 212, you can access them through the following links: POL 211 Discussion, POL 212 Discussion.
When it comes to regression analysis, choosing the right approach is crucial for accurate predictions and meaningful insights. Two common methods used are parametric, like linear regression, and semi/non-parametric, like smoothing spline regression or Kernal regression. Each has its own advantages and disadvantages, and the choice between them largely depends on the nature of the data and the underlying relationships.
Parametric Regression Linear regression is a well-known parametric method that assumes a linear functional form for the relationship between the predictors (\(X\)) and the target variable (\(Y\)). This approach has several benefits, such as ease of estimation with a small number of coefficients. In linear regression, these coefficients have straightforward interpretations, and statistical significance tests are readily applicable. However, parametric methods come with a significant limitation — they rely on the assumption that the specified functional form is a close approximation to the true relationship. If this assumption is far from reality, linear regression can perform poorly and yield unreliable results.
Nonparametric Regression On the other hand, non-parametric methods like K-Nearest Neighbors (KNN) regression do not make explicit assumptions about the functional form of the relationship between \(X\) and \(Y\). Instead, they provide a more flexible approach for regression. KNN regression identifies the K training observations closest to a prediction point and estimates the target variable by averaging their values. While this approach is more versatile and can handle complex relationships, it can suffer from high variance when K is small, leading to overfitting. Conversely, when K is large, KNN regression can underfit the data.
Assume that we have a outcome variable \(Y\) and two explanatory variables, \(x_1\) and \(x_2\). In general, the regression model that describes the relationship can be written as:
\[Y = f_1(x_1) + f_2(x_2) + \epsilon\]
Some parametric regression models:
If we do not know \(f_1\) and \(f_2\) functions, we need to use a Nonparametric regression model.
K-Nearest Neighbors (KNN) regression is one of the simplest and best-known nonparametric methods. Given a value for \(K\) and a prediction point \(x_0\), KNN regression first identifies the \(K\) training observations that are closest to \(x_0\), represented by \(N_0\). It then estimates \(f(x_0)\) using the average of all the training responses in \(N_0\). In other words,
\[\hat{f}(x_0)=\frac{1}{K}\sum_{x_i \in N_0}y_i\]
When using KNN as a regressor with a continuous dependent variable, data points are scattered across the coordinate plane. When a new data point is introduced, the number of neighbors (\(K\)) is determined using any of the distance metrics.Usually, the Euclidean distance is used as the distance metric. Once the neighbors are identified, the predicted value of the new data point is calculated as the average of all the neighbors’ values combined.
The below figure illustrates two KNN fits on a data set with \(p = 2\) predictors. The fit with \(K = 1\) is shown in the left-hand panel, while the right-hand panel corresponds to \(K = 9\).We see that when \(K = 1\), the KNN fit perfectly interpolates the training observations, and consequently takes the form of a step function. When \(K = 9\), the KNN fit still is a step function, but averaging over nine observations results in much smaller regions of constant prediction, and consequently a smoother fit. In general, the optimal value for \(K\) will depend on the bias-variance tradeoff, which Chris introduced in POL 212. A small value for \(K\) provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. In contrast, larger values of \(K\) provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in \(f(X)\).
The bias-variance tradeoff is a fundamental concept in machine learning that deals with the balance between the bias (error from overly simplistic assumptions) and variance (sensitivity to fluctuations in the training data) of a model. In the context of KNN (K-Nearest Neighbors), adjusting the value of K can help control this tradeoff.
Low \(K\) (e.g., \(K=1\)): Low bias, high variance. The model closely follows the training data, leading to high variance and sensitivity to noise. It can capture complex patterns but may overfit the training data.
High \(K\) (e.g., large \(K\)): High bias, low variance. The model averages over more data points, resulting in smoother predictions with lower variance but potentially higher bias. It may underfit the data by oversimplifying the underlying patterns.
Choosing an appropriate value for \(K\) involves finding a balance between bias and variance to achieve optimal model performance. This tradeoff influences the model’s ability to generalize to unseen data.
There are no pre-defined statistical methods to find the most favorable value of \(K\), but here is a typical way of how to choose an optimal \(K\):
The key question is when to choose a parametric approach like linear regression over a non-parametric one such as KNN regression. The answer is straightforward: a parametric approach performs better when the chosen functional form is a close match to the true relationship, particularly in the presence of a linear relationship. If the specified functional form is far from the truth, and prediction accuracy is our goal, then the parametric method will perform poorly. For instance, if we assume a linear relationship between \(X\) and \(Y\) but the true relationship is far from linear, then the resulting model will provide a poor fit to the data, and any conclusions drawn from it will be suspect.
In contrast, non-parametric methods do not explicitly assume a parametric form for \(f(X)\), and thereby provide an alternative and more flexible approach for performing regression.
To illustrate this point, let’s consider a few scenarios and use R to simulate them:
Linear Relationship: When the true relationship between \(X\) and \(Y\) is linear, linear regression outperforms nonparametric regression. Linear regression provides an almost perfect fit in this situation, as it closely matches the underlying relationship.
set.seed(1234) # for replication
N <- 1000 # set up the sample size
x1 <- rnorm(N, mean = 10, sd = 3)
e <- rnorm(N, mean = 0, sd = 3) # set the errors to be normally distributed
y <- 7 + 3*x1 + e # set the true y
df <- data.frame(id = 1:N, y = y, x1 = x1, e = e)
In the above chunk, I specify the true \(y\) to be a function of \(x_1\) and \(e\). The true values of the regression coefficients is 3 for \(x_1\). The intercept is 7. Let’s first plot the true relationship between \(x_1\) and \(y\)!
library(tidyverse)
ggplot(data = df, aes(y = y, x = x1)) +
geom_point(alpha = 0.6) +
geom_smooth() +
theme_bw()
Let’s begin by splitting the generated data into a training set, comprising 80% of the data, and a test set, comprising the remaining 20%.
set.seed(1234) # set the seed to make the partition reproducible
train <- df %>% sample_frac(0.8) # should have 800 obs
test <- anti_join(df, train, by = 'id') # should have 200 obs
Now, let’s first apply the lm() function to our training
set. We’ll then use the coefficients estimated from this model to
predict the y-values in the test set, followed by computing the mean
squared error (MSE).
fit <- lm(y ~ x1, data = train)
|
|
OLS |
||
|---|---|---|---|
|
Variables |
Estimates |
S.E. |
C.I. (95%) |
|
(Intercept) |
6.50 *** |
0.35 |
5.81 – 7.19 |
|
x1 |
3.06 *** |
0.03 |
2.99 – 3.13 |
|
Observations |
800 |
||
|
R2 / R2 adjusted |
0.910 / 0.910 |
||
|
|||
test$y_hat <- predict(fit, newdata = data.frame(x1 = test$x1))
# compute mean squared error (MSE)
MSE_ols <- sum((test$y - test$y_hat)^2)/nrow(test)
MSE_ols
## [1] 8.770813
The mean squared error (MSE) obtained by using OLS regression to estimate the relationship between \(y\) and \(x_1\) in the training set and then using the estimated coefficients to predict \(y\) in the test set is 8.770813. Now, let’s proceed to use KNN regression with various values of \(K\) to observe how the MSE varies.
We will use the knn.reg() function from the
FNN package for KNN regression. This function requires four
arguments to be specified:
train: the predictors of the training datatest: the predictor values, \(x\), at which we would like to make
predictionsy: the response for the training datak: the number of neighbors to consider# install.packages("FNN")
library(FNN)
train_no_y <- train %>% select(x1)
test_no_y <- test %>% select(x1)
train_y <- train %>% select(y)
yhat_k1 <- knn.reg(train = train_no_y, test = test_no_y, y = train_y, k = 1)
yhat_k1
## Prediction:
## [1] 29.381821 28.304426 65.609505 23.837209 34.330518 33.436274 21.824014
## [8] 30.847990 32.289959 42.276634 48.391432 53.703199 31.595459 33.423440
## [15] 32.935792 41.304789 26.486024 39.445252 52.119807 38.608137 44.000824
## [22] 38.608137 37.159409 38.540173 33.079801 39.907787 44.810978 37.870818
## [29] 34.348167 42.174368 40.401263 34.996959 27.478094 48.391432 33.978417
## [36] 26.367731 56.238177 31.240796 37.950729 55.185947 34.850824 43.470660
## [43] 44.483329 21.700094 49.108239 44.558484 41.288458 40.623165 31.965684
## [50] 28.174148 35.309716 44.393660 40.512042 48.391432 26.800229 31.877765
## [57] 45.551430 28.354259 39.391663 43.100323 33.168206 25.652004 37.857259
## [64] 42.280182 31.905242 30.070676 45.434786 32.597936 55.713238 30.245094
## [71] 35.747199 24.318503 35.309716 42.142535 44.687740 34.463743 41.607736
## [78] 11.887226 36.074342 55.185947 33.106591 64.709783 20.566998 54.842595
## [85] 45.434786 51.245243 39.073328 28.874519 41.177921 35.357025 30.071689
## [92] 36.168858 35.095983 28.392045 31.241205 32.798233 29.709591 37.191013
## [99] 38.608137 26.841422 44.442204 24.630053 26.367731 19.105273 33.290191
## [106] 40.401263 56.238177 21.708475 33.771862 38.654251 42.868435 49.161958
## [113] 33.168206 26.492572 27.573812 41.597145 30.479421 45.411737 40.623165
## [120] 37.208831 35.170920 8.539242 48.085100 44.483329 20.591604 21.824014
## [127] 45.068575 40.911258 35.964535 44.731334 38.432659 44.912912 38.520474
## [134] 33.390623 24.145923 22.517012 37.950729 41.465057 36.782189 31.241205
## [141] 33.423440 29.709591 28.827443 21.824014 39.739236 49.108239 41.337496
## [148] 51.204406 31.801673 31.406837 30.414360 32.690670 39.641522 28.351906
## [155] 44.206919 32.373809 48.402430 31.595459 67.724169 27.404886 36.782189
## [162] 42.020108 54.160505 36.074342 48.649611 34.330518 44.253326 42.549677
## [169] 36.377909 46.279074 27.157538 25.692918 30.150333 29.936407 50.422834
## [176] 27.404886 37.870818 33.978417 29.820616 32.646670 28.464324 26.137602
## [183] 29.774837 38.787417 29.728468 39.795625 54.842595 48.649611 38.169491
## [190] 38.146542 33.233128 36.284027 41.820483 35.838288 38.441584 32.682236
## [197] 28.521211 41.054304 52.592726 44.479609
As observed from the above results, when setting \(K = 1\) in knn.reg(), it
provides the predicted \(y\) values for
the test set. Now, let’s calculate the MSE for this prediction.
MSE_k1 <- sum((test$y - yhat_k1$pred)^2)/nrow(test)
MSE_k1
## [1] 20.25002
As you can see, the MSE when employing KNN regression with \(K = 1\) is significantly higher than the MSE observed with OLS regression. Let’s create a for loop and repeat this process for various values of \(K\) to determine whether the MSE improves or worsens compared to the MSE when using OLS.
MSE_KNN <- c()
for (i in seq(1, 100, by = 2)){
yhat_k <- knn.reg(train = train_no_y, test = test_no_y, y = train_y, k = i)
MSE_k <- sum((test$y - yhat_k$pred)^2)/nrow(test)
MSE_KNN <- c(MSE_KNN, MSE_k)
}
MSE_KNN
## [1] 20.250016 11.554227 10.170821 10.020766 10.002982 9.655185 9.468217
## [8] 9.151752 9.058586 8.825635 8.704344 8.733093 8.672769 8.708741
## [15] 8.678995 8.627246 8.531880 8.596169 8.653730 8.691982 8.783636
## [22] 8.831042 8.858023 8.874769 8.833497 8.853275 8.927840 8.982527
## [29] 8.938121 8.948997 9.004996 9.012363 8.970862 8.969159 9.031482
## [36] 9.058308 9.110250 9.158928 9.255961 9.305864 9.352528 9.415657
## [43] 9.468395 9.539323 9.576483 9.650497 9.693073 9.724640 9.752314
## [50] 9.824734
ggplot(data = data.frame(K = seq(1, 100, by = 2), MSE = MSE_KNN),
aes(y = MSE, x = K)) +
geom_line(alpha = 0.4, color = "darkgreen") +
geom_point(alpha = 0.3, color = "darkgreen") +
geom_hline(yintercept = MSE_ols, color = "darkgray",
alpha = 0.8, linetype = "dashed") +
annotate("text", x = 90, y = 18, label = "KNN", color = "darkgreen", alpha = 0.5) +
annotate("text", x = 90, y = 17, label = "OLS", color = "darkgray") +
theme_bw()
After simulation, it’s evident that when the true relationship between the variables of interest is linear, KNN regression doesn’t offer superior predictions compared to OLS regression.
Slight Non-Linearity: In cases of slight non-linearity, where the true relationship deviates slightly from linearity, KNN regression can perform nearly as well as linear regression. It still provides reasonable results without a substantial reduction in prediction accuracy.
Strong Non-Linearity: However, in situations with a strong non-linear relationship, KNN regression outperforms linear regression. This is because KNN can adapt to complex relationships, providing more accurate predictions.
set.seed(1234) # for replication
N <- 1000 # set up the sample size
x1 <- rnorm(N, mean = 1, sd = 10)
e <- rnorm(N, mean = 3, sd = 10) # set the errors to be normally distributed
y <- 3 + 4*(x1+3)^2*(x1-1)^3 + e # set the true y
df <- data.frame(id = 1:N, y = y, x1 = x1, e = e)
For example, let’s now generate a relationship between \(x_1\) and \(y\) that is strongly non-linear. I specify the true \(y\) to be a function of \(x_1\) and \(e\). Overall, the formula, \(y = 3+4(x_1+3)^2(x_1-1)^3+e\), describes a relationship where \(y\) is influenced by \(x_1\) in a nonlinear way, with two distinct “bumps” or curves, along with some random variability represented by \(e\).
library(tidyverse)
ggplot(data = df, aes(y = y, x = x1)) +
geom_point(alpha = 0.6) +
geom_smooth() +
theme_bw()
Following our previous approach, let’s now compare the MSE obtained from OLS regression, used to estimate the relationship between \(y\) and \(x_1\), with the MSE derived from KNN regression, employing different values of \(K\).
Step 1: Split the Sample in to Training and Test Set
set.seed(1234) # set the seed to make the partition reproducible
train <- df %>% sample_frac(0.8) # should have 800 obs
test <- anti_join(df, train, by = 'id') # should have 200 obs
fit2 <- lm(y ~ x1, data = train)Step 2: Run OLS Regression and Compute the MSE
fit2 <- lm(y ~ x1, data = train)
|
|
OLS |
||
|---|---|---|---|
|
Variables |
Estimates |
S.E. |
C.I. (95%) |
|
(Intercept) |
95970.43 |
389692.62 |
-668973.27 – 860914.13 |
|
x1 |
734839.02 *** |
38319.47 |
659620.15 – 810057.88 |
|
Observations |
800 |
||
|
R2 / R2 adjusted |
0.315 / 0.315 |
||
|
|||
test$y_hat <- predict(fit, newdata = data.frame(x1 = test$x1))
MSE_ols <- sum((test$y - test$y_hat)^2)/nrow(test)
MSE_ols
## [1] 4.742095e+13Step 3: Run KNN and Compute the MSE
train_no_y <- train %>% select(x1)
test_no_y <- test %>% select(x1)
train_y <- train %>% select(y)
MSE_KNN <- c()
for (i in seq(1, 100, by = 2)){
yhat_k <- knn.reg(train = train_no_y, test = test_no_y, y = train_y, k = i)
MSE_k <- sum((test$y - yhat_k$pred)^2)/nrow(test)
MSE_KNN <- c(MSE_KNN, MSE_k)
}
MSE_KNN
## [1] 1.006088e+11 1.791096e+11 1.840990e+11 5.863832e+11 9.814868e+11
## [6] 4.548076e+11 1.081960e+12 1.129925e+12 1.487238e+12 2.374122e+12
## [11] 2.763030e+12 3.688595e+12 4.638701e+12 5.402272e+12 6.302316e+12
## [16] 6.098534e+12 6.924194e+12 7.720989e+12 8.508775e+12 9.279200e+12
## [21] 9.522820e+12 1.021639e+13 1.021119e+13 1.086252e+13 1.149660e+13
## [26] 1.209597e+13 1.238768e+13 1.296902e+13 1.354249e+13 1.352895e+13
## [31] 1.408380e+13 1.458944e+13 1.475359e+13 1.524066e+13 1.571534e+13
## [36] 1.618749e+13 1.664102e+13 1.702627e+13 1.745704e+13 1.775741e+13
## [41] 1.815320e+13 1.818130e+13 1.853354e+13 1.892895e+13 1.930142e+13
## [46] 1.966064e+13 2.001759e+13 2.037685e+13 2.072546e+13 2.106549e+13Step 4: Compare the MSE obtained from OLS and with that from KNN
ggplot(data = data.frame(K = seq(1, 100, by = 2), MSE = MSE_KNN),
aes(y = MSE, x = K)) +
geom_line(alpha = 0.4, color = "darkgreen") +
geom_point(alpha = 0.3, color = "darkgreen") +
geom_hline(yintercept = MSE_ols, color = "darkgray",
alpha = 0.8, linetype = "dashed") +
annotate("text", x = 90, y = 4e+13, label = "KNN", color = "darkgreen", alpha = 0.5) +
annotate("text", x = 90, y = 3.7e+13, label = "OLS", color = "darkgray") +
theme_bw()
As observed, regardless of the choice of \(K\) in our KNN regression, the MSE obtained from KNN is consistently much smaller than that from OLS.
Curse of Dimensionality: When dealing with high-dimensional data (i.e., when you have a lot of predictors), KNN regression may suffer from the “curse of dimensionality.” In such cases, the performance of KNN deteriorates significantly as the dimensionality of the data increases. In the case of KNN regression, as the number of dimensions increases, the distance between data points becomes less meaningful. This is because in high-dimensional spaces, the concept of distance becomes less discriminating, as most data points are located far apart from each other. Consequently, KNN may struggle to find relevant neighbors, leading to poor predictive performance and increased computational complexity. On the other had, linear regression, with its fewer parameters, is less affected by this issue.
Even in problems in which the dimension is small, we might prefer linear regression to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression, we might be willing to forego a little bit of prediction accuracy for the sake of a simple model that can be described in terms of just a few coefficients, and for which p-values are available.
Below is a population regression line equation, assuming that we are the god and know the data-generating process of the relationship between \(X\) (independent/explanatory variable) and \(Y\) (dependent/response/outcome variable).
However, since we usually don’t have the population data but only have access to a sample of it, we can only use the sample dataset to estimate the population regression line. In the below figure, \(b_0\) is the estimate of the regression intercept \(\beta_0\), while \(b_1\) is the estimate of the regression coefficient/slope \(\beta_1\).
The Ordinary Least Squares (OLS) approach aims to fit a line that minimizes the sum of the squared residuals, which means that the goal is to find a pair of \(b_0\) and \(b_1\) that can minimize the difference between our observed \(Y_i\) and estimated \(\hat{Y_i}\).
\[S(A, B) = min \Sigma E_i = min \Sigma (Y_i-\hat{Y_i})^2=min \Sigma (Y_i-(A+BX_i))^2\]
\[S(A, B) = \Sigma (Y_i-A-BX_i)^2\]
To find a pair of \(b_0\) and \(b_1\) that can minimize the difference between our observed \(Y_i\) and estimated \(\hat{Y_i}\) is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to \(b_0\) and \(b_1\), respectively, set them equal to zero, and then solve it.
\[\frac{\partial S(A, B)}{\partial A}=\Sigma (-1)(2)(Y_i-A-BX_i)=0\]
\[\frac{\partial S(A, B)}{\partial B}=\Sigma (-X_i)(2)(Y_i-A-BX_i)=0\]
By solving the above two equations, you will get:
\[A = \bar{Y}-B\bar{X}\]
\[B = \frac{\Sigma (X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma (X_i-\bar{X})^2}\]
Gauss Markov theorem states that “if the errors are independently distributed with zero expectation and constant variance, then the least-squares estimator b is the most efficient linear unbiased estimator of \(\beta\). That is, of all unbiased estimators that are linear functions of the observations, the least-squares estimator has the smallest sampling variance and, hence, the smallest mean-squared error. For this reason, the least-squares estimator is sometimes termed BLUE, an acronym for best linear unbiased estimator” (Fox 2016: 212).
In other words, under certain conditions, the ordinary least squares (OLS) estimates are thought to be the best linear unbiased estimator (BLUE). Here, I review a set of assumptions (known as Gauss-Markov assumptions) that leads to a BLUE estimator.
Linearity: the expected (mean) value of the disturbance term is 0 (i.e., \(E(u_i)=0\)).
Fox (2016) explains that a violation of this assumption implies “that the model fails to capture the systematic pattern of relationship between the response and explanatory variables” (307). This assumption could be broken in two ways: one, the mean of error could be a constant across all values of \(X_i\), two, it could vary. In the former case, the intercept term will be consistently off by the error amount. A uniform measurement error across all observations would be a cause of such an error (Berry 1993: 42). A common cause of the latter case will be omitted variable bias. Omitted variable bias deserves particular attention because it can cause the OLS estimate to be both biased and inconsistent. Another common cause is the wrong functional form (e.g., specifying a linear relation instead of a quadratic relation). Thus, when nonlinearity is detected, we should explore not only variable transformations and different model specifications (Fox, 2016: 317), but also the possibility of omitted variables. In short, before running OLS, reseachers should make sure that the inclusion of relevant variables in the model and the setting of the function are resonable and are based on the theoretical understanding.
Nonstochastic regressors: \(X\) values are independent of the error term (i.e., \(cov(X_i, u_i)=0\)).
The violation of this assumption suggests that there is an endogeneity problem between the explanatory and outcome variables, which can arise from measurement error on \(X\), omitted confounders, or reciprocal causation between \(X\) and \(Y\). Again, if the main goal of the research is to provide an unbiased and consistent estimator to explain how a social phenomenon works in a real-life setting, to provide a convincing explanation requires us to make sure that the effect of our explanatory variable is actually exogenous.Fortunately, scholars have developed many methods to address this issue. For example, researchers could employ an instrumental variable or matching to improve the unbiasedness of the estimated effect.
Homoskedasticity: constant error variance across values of \(X_i\) (i.e., \(Var(\varepsilon)=\sigma^2\)).
One instance where heteroskedasticity is often violated is cross-sectional studies (Berry, 1993: 73). For example, because more developed countries can have better data accuracy, measurement error could be correlated with the level of development. Thus, in a study that includes the level of development as one of the independent variables, the variance of error term may not be constant (Berry, 1993: 73). In such circumstances, Fox (2016) suggests that we can run weighted-least-squares (WLS) regression to overcome such probelm (304). Other solutions include transformation of the dependent variable and correction of coefficient standard errors (e.g. robust standard errors and clustered standard errors) for heteroskedasticity (Fox, 2016: 307).
Independence: the observations are sampled independently. no autocorrelation between disturbances (i.e., \(cov(u_i, u_j)=0\) for \(i \neq j\)).
The independence assumption is important for time-series analysis, where the assumption of spherical errors is often violated. To overcome this, Beck & Katz (1995) propose panel-correlated standard errors to when analyzing time-series data.
Normality: errors are distributed normally (i.e.,\(\varepsilon \sim N(0, \sigma^2)\)).
Assumptions of 1 and 2 are related to the bias of the estimator, whereas assumptions of 3 and 4 are related to the efficiency of the estimator: when the assumption of 1 or 2 is violated, the OLS estimator is no longer unbiased; when the assumption of 3 or 4 is violated, the OLS estimator is no longer efficient. With these weak set of assumptions (1-4), according to the Gauss-Markov theorem, the OLS estimator is BLUE.
(In statistics, efficiency is a measure of the quality of an estimator, which can be characterized by having a smaller possible variance.)
With the fifth assumption, normality, the OLS estimator is the most efficient among all the unbiased estimators, not just linear estimators. Adding this fifth assumption makes the Gauss-Markov theorem a strong set. This last assumption is important when we aim to make a causal inference using the OLS estimator. With the normal distribution of the errors assumption, we can also infer that the OLS estimators also have normal sampling distributions. As a result, it allows us to apply statistical tests such as t-tests even for small sample size studies.
One important caveat is that when there is perfect multicollinearity (i.e. when two or more independent variables are perfectly correlated), we cannot even get an estimate using the OLS method. However, even when there is high multicollinearity, the OLS estimate is still BLUE (Berry 1993: 27). In such cases, standard errors will be very high, making the estimates fluctuate considerably from sample to sample (Berry 1993: 27).
The below table presents the consequences of the violation of Gauss-Markov assumptions and corresponded suggested solutions.
| Assumption | Violation | Solution |
|---|---|---|
| Linearity | Biased/inconsistent estimates | Transformations, polynomials, different model |
| Nonstochastic regressors | Biased/inconsistent estimates | Instrumental variables |
| Homoskedasticity | Biased standard errors | Robust standard errors |
| Independence | Biased standard errors | Fixed effects |
| Perfect collinearity | cannot run OLS | Omit one collinear term |
The sample least squares coefficients are unbiased estimators of the population regression coefficients.
Here are some reviews before demonstrating the unbiasedness of the least-squares estimators \(A\) and \(B\) for \(\alpha\) and \(\beta\):
Demonstrate the unbiasedness of the least-squares estimators \(B\) for \(\beta\) in simple regression (i.e., \(E(B) = \beta\)).
\(E(B)=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)
\(=\frac{\Sigma(X_i-\bar{X})Y_i}{\Sigma(X_i-\bar{X})^2}\)
\(=\frac{1}{\Sigma(X_i-\bar{X})^2}E(\Sigma(X_i-\bar{X})Y_i)\)
\(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})E(Y_i)\)
\(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})E(\alpha + \beta X_i+\varepsilon_i)\)
\(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})(\alpha +\beta X_i+E(\varepsilon_i))\)
\(=\frac{1}{\Sigma(X_i-\bar{X})^2}\Sigma(X_i-\bar{X})(\alpha+\beta X_i)\)
\(=\frac{1}{\Sigma(X_i-\bar{X})^2}(\Sigma(X_i-\bar{X})\alpha+\Sigma(X_i-\bar{X})\beta X_i)\)
\(=\frac{\beta}{\Sigma(X_i-\bar{X})^2}(\Sigma(X_i-\bar{X})X_i)\)
\(=\beta\)
Demonstrate the unbiasedness of the least-squares estimators \(A\) for \(\alpha\) in simple regression (i.e., \(E(A) = \alpha\)).
\(E(A)=E(\bar{Y}-B\bar{X})\)
\(=E(\frac{1}{n}\Sigma Y_i-\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\bar{X})\)
\(=E(\frac{1}{n}\Sigma Y_i-\frac{\Sigma(X_i-\bar{X})Y_i}{\Sigma(X_i-\bar{X})^2}\bar{X})\)
\(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2})Y_i)\)
\(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2})(\alpha +\beta X_i +\varepsilon_i))\)
\(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2}))E(\alpha +\beta X_i +\varepsilon_i)\)
\(=E(\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2}))(\alpha+\beta X_i+E(\varepsilon_i))\)
\(=\Sigma(\frac{1}{n}-\frac{\bar{X}(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2})(\alpha+\beta X_i)\)
\(=\Sigma \frac{\alpha}{n}+\Sigma \beta \frac{X_i}{n}-\alpha \frac{\bar{X}\Sigma(X_i-\bar{X})}{\Sigma(X_i-\bar{X})^2}-\beta \frac{\bar{X}\Sigma(X_i-\bar{X})X_i}{\Sigma(X_i-\bar{X})^2}\)
\(=\alpha + \beta \bar{X}-\alpha*0-\beta \bar{X}\)
\(=\alpha\)
Derive the sampling variances of \(A\) in a simple regression.
\(Var(B)=Var(\frac{\Sigma(X_i-\bar{X})Y_i}{\Sigma(X_i-\bar{X})^2})\)
\(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X})Y_i)\)
\(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X})(\alpha + \beta X_i+\varepsilon_i))\)
\(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X}(\alpha + \beta X_i)+\Sigma(X_i-\bar{X})\varepsilon_i)\)
\(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}Var(\Sigma(X_i-\bar{X})\varepsilon_i)=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}\Sigma Var((X_i-\bar{X})\varepsilon_i)\)
\(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}(\Sigma (X_i-\bar{X})^2 Var(\varepsilon_i))\)
\(=\frac{1}{(\Sigma(X_i-\bar{X})^2)^2}(\Sigma (X_i-\bar{X})^2 \sigma_\varepsilon^2)\)
\(=\frac{\sigma_\varepsilon^2}{\Sigma(X_i-\bar{X})^2}\)
Derive the sampling variances of \(B\) in a simple regression.
\(Var(A)=Var(\bar{Y}-B\bar{X})\)
\(=Var(\bar{Y})+Var(B\bar{X})-2Cov(\bar{Y},B\bar{X})\)
\(=Var(\bar{Y})+Var(B\bar{X})\)
\(=Var(\frac{\Sigma Y_i}{n})+\bar{X}^2Var(B)\)
\(=\frac{1}{n^2}Var(\Sigma Y_i)+\bar{X}^2Var(B)\)
\(=\frac{1}{n^2}\Sigma Var(Y_i)+\bar{X}^2Var(B)\)
\(=\frac{1}{n^2}\Sigma Var(\alpha+\beta X_i+\varepsilon_i)+\bar{X}^2Var(B)\)
\(=\frac{1}{n^2}\Sigma Var(\varepsilon_i)+\bar{X}^2Var(B)\)
\(=\frac{1}{n^2}n\sigma_\varepsilon^2+\bar{X}(\frac{\sigma_\varepsilon^2}{\Sigma(X_i-\bar{X})^2})\)
\(=\sigma_\varepsilon^2(\frac{1}{n}+\frac{\bar{X}^2}{\Sigma(X_i-\bar{X})^2})\)
\(=\sigma_\varepsilon^2(\frac{\Sigma(X_i-\bar{X})^2 +n\bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)
\(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-2\bar{X}\Sigma X_i+\Sigma \bar{X}^2+n\bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)
\(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-2\bar{X}\Sigma X_i+2n \bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)
\(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-2\frac{\Sigma X_i}{n}\Sigma X_i+2n \bar{X}^2}{n\Sigma(X_i-\bar{X})^2})\)
\(=\sigma_\varepsilon^2(\frac{\Sigma X_i^2-\frac{2\Sigma X_i^2}{n}+2n\frac{\Sigma X_i^2}{n^2}}{n\Sigma(X_i-\bar{X})^2})\)
\(=\frac{\sigma_\varepsilon^2 \Sigma X_i^2}{n\Sigma(X_i-\bar{X})^2}\)
As mentioned earlier, the fifth assumption, normality, allows us to apply statistical inference tests to assess the \(\beta\)s we estimate. Recall from the lecture that the residual standard error \(S_E\) (how closely the regression line we estimate fits the scatter of our data points) is:
\[\hat{\sigma } = SE(E_i) = \sqrt{\frac{\Sigma (E_i)^2}{n-2}} = \sqrt{\frac{\Sigma (Y_i-\hat{Y_i})^2}{n-2}}\]
And the standard error of the sample intercept (\(A\)) and slope (\(B\)) are:
\[SE(A) = \hat{\sigma }\sqrt{\frac{\Sigma X_i^2}{n\Sigma (X_i-\bar{X})^2}}\]
\[SE(B) = \frac{\hat{\sigma }}{\sqrt{\Sigma (X_i-\bar{X})^2}}\]
With standard errors, we can construct a \(100(1-\alpha)%\) confidence interval for our slope and perform hypothesis testing.
Confidence Interval
\(\beta=B+t_{\frac{\alpha}{2}}SE(B)\)
Hypothesis Test
Two-tailed test (by defualt, lm()function conducts
two-tailedtest)
\(H_0: \beta = 0\)
\(H_1: \beta \neq 0\)
\(t=\frac{b_1-\beta_1}{SE(b_1)}=\frac{b_1-0}{SE(b_1)}\)
Compare this t-statistic with \(t_{\frac{\alpha}{2}, df=n-2}\) to see if it is larger or smaller than the critical value or not. If it is larger or smaller than the critical value, we can reject \(H_0\) in favor of \(H_1\). In addition to calculating t-statistics, we can also calculate p-value based the t-statistics and the critical values to perform hypothesis testing (p-value \(= 2*Pr(t \geq |t_c|)\) vs \(\alpha\)).
One-tailed test (if your hypotheses are directional, you can also conduct one-tailed test)
\(H_0: \beta = 0\)
\(H_1: \beta < or > 0\)
\(t=\frac{B-\beta_1}{SE(B)}=\frac{B-0}{SE(B)}\)
When performing a one-tailed test, compare the t-statistic with \(t_{\alpha, df=n-2}\) without dividing \(\alpha\) by 2. P-value in one-tailed test is calculated as:
p-value \(= Pr(t \geq or \leq t_c)\) vs \(\alpha\)
Let’s run a simple example of it using the prestige data
from Problem Set 1. The prestige data consists of 10
observations with 2 variables. The description of the variables are as
follows:
prestige:The average prestige rating for the
occupation.
education: The average number of years of education
for occupational incumbents.
# load data
df <- read.csv("/Users/yu-shiuanhuang/Desktop/method-sequence/data/ps1_prestige.csv")
df
## prestige education
## 1 82 86
## 2 83 76
## 3 90 92
## 4 76 90
## 5 90 86
## 6 87 84
## 7 93 93
## 8 90 100
## 9 52 87
## 10 88 86
Let’s first calculate all the statistics we’ll need to perform hypothesis testing.
Slope: \(B\)
\(B=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)
# calculate the slope b1
B <- sum((df$education-mean(df$education))*(df$prestige-mean(df$prestige)))/
sum((df$education-mean(df$education))^2)
B
## [1] 0.3895028
Intercept: \(A\)
\(A=\bar{Y}-B\bar{X}\)
# calculate the intercept b0
A <- mean(df$prestige)-B*mean(df$education)
A
## [1] 48.82376
Residual Standard Error: \(\hat{\sigma} = SE(E_i)\)
\(\hat{\sigma } = SE(E_i) = \sqrt{\frac{\Sigma (E_i)^2}{n-2}} = \sqrt{\frac{\Sigma (Y_i-\hat{Y_i})^2}{n-2}}\)
# generate predicted Y
df$pred <- A + B*df$education
#calculate the residual standard error
se <- sqrt(sum((df$prestige-df$pred)^2)/(nrow(df)-2))
se
## [1] 12.46986
Standard Error of the Estimated Intercept and Slope: \(SE(A)\) & \(SE(B)\)
\(SE(A) = \hat{\sigma }\sqrt{\frac{\Sigma X_i^2}{n\Sigma (X_i-\bar{X})^2}}\)
\(SE(B) = \frac{\hat{\sigma }}{\sqrt{\Sigma (X_i-\bar{X})^2}}\)
# calculate the standard error of the estimated intercept b0
se_A <- se*sqrt(sum(df$education^2)/(nrow(df)*sum((df$education-mean(df$education))^2)))
se_A
## [1] 57.80998
# calculate the standard error of the estimated slope b1
se_B <- se/sqrt(sum((df$education-mean(df$education))^2))
se_B
## [1] 0.6554015
Now, let’s perform two-tailed hypothesis testing of the slope.
\(H_0: \beta = 0\)
\(H_1: \beta \neq 0\)
\(t=\frac{B-\beta_1}{SE(B)}=\frac{B-0}{SE(B)}=\frac{0.3895028-0}{0.6554015}=0.5942964\)
# calculate the t-statistic
t <- B/se_B
t
## [1] 0.5942964
What are the t critical values in both tails? Let’s say the level of significance, \(\alpha\), is 0.05 in this case.
\(|t_{\frac{\alpha}{2}, df=n-2}|=|t_{\frac{0.05}{2}, df=10-2}|=2.306004\)
# find t critical value
qt(p = 0.025, df = 8, lower.tail = TRUE)
## [1] -2.306004
qt(p = 0.025, df = 8, lower.tail = FALSE)
## [1] 2.306004
Now, let’s compare our t-statistic of \(b_1\) with the t critical value.
\(t=0.5942964 < |t_{\frac{0.05}{2}, df=8}|=2.306004\)
We can also calculate the p-value.
p-value \(= 2*Pr(t \geq 0.5942964)=0.5687352 > 0.05\)
# find p-value
2*pt(q = 0.5942964, df = 8, lower.tail = FALSE)
## [1] 0.5687352
These all suggest that we cannot reject \(H_0\), which means that the positive effect of education on one’s prestige is not statistically significant.
Let’s also try perform one-tailed hypothesis testing for this case.
\(H_0: \beta = 0\)
\(H_1: \beta > 0\)
\(t=\frac{B-\beta_1}{SE(b_1)}=\frac{B-0}{SE(B)}=\frac{0.3895028-0}{0.6554015}=0.5942964\)
What is the t critical values in the right tail?
\(t_{\alpha, df=n-2}=t_{0.05, df=10-2}=1.859548\)
# find t critical value
qt(p = 0.05, df = 8, lower.tail = FALSE)
## [1] 1.859548
Now again, let’s compare our t-statistic of \(B\) with the t critical value.
\(t=0.5942964 < t_{0.05, df=8}=1.859548\)
And the p-value in the one-tailed test is:
p-value \(=Pr(t \geq 0.5942964)= 0.2843676 > 0.05\)
# find p-value
pt(q = 0.5942964, df = 8, lower.tail = FALSE)
## [1] 0.2843676
In R, lm() does all the above calculation for us!
# simple regression
fit <- lm(prestige ~ education, data = df)
summary(fit)
##
## Call:
## lm(formula = prestige ~ education, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.7105 0.3157 4.9580 5.6238 7.9525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.8238 57.8100 0.845 0.423
## education 0.3895 0.6554 0.594 0.569
##
## Residual standard error: 12.47 on 8 degrees of freedom
## Multiple R-squared: 0.04228, Adjusted R-squared: -0.07743
## F-statistic: 0.3532 on 1 and 8 DF, p-value: 0.5687We understand that there will be variability in the variable
prestige across different occupations, with some having
high prestige and others having low prestige. Some of this variability
can be attributed to factors not included as variables in our regression
model, such as economic conditions and other unmeasured variables.
Additionally, differences in the variable education may
explain some of the differences in prestige across
occupations. For instance, occupations with a higher average number of
years of education among incumbents tend to have higher prestige.
However, it’s important to quantify how much of the variability in
prestige can be explained by the variables included in our
regression model, and how much remains unexplained by other factors that
we have not accounted for.
Explained Variability: Regression Sum of Squares (RegSS)
The Regression Sum of Squares (RegSS) (sometimes referred to as the Sum of Squares Explained, SSE) is a measure of the variability in the outcome variable that is explained by the explanatory variables, i.e. the x-variables in your regression. It is given by the following sum:
\(RegSS = \sum_{i=1}^{n}(\hat{y_i}-\bar{y})^2\)
Residual or Unexplained Variability: Residual Sum of Squares (RSS)
The Residual Sum of Squares (RSS) is a measure of the variability in
the outcome variable that is not explained by your regression, and
therefore is due to all the other factors that affect
prestige besides education. It is given by the
following sum:
\(RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 = \sum_{i=1}^{n}e_i^2\)
Total Variability: Total Sum of Squares (TSS)
You can show mathematically that \(RSS + RegSS\) is equal to the following expression, which is referred to as the Total Sum of Squares (TSS):
\(TSS = \sum_{i=1}^{n}(y_i-\bar{y_i})^2 = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 + \sum_{i=1}^{n}(\hat{y_i}-\bar{y})^2 = RSS + RegSS\)
Coefficient of Determination: R-Squared value
The coefficient of determination, sometimes referred to as the R-Squared value, is a measure of what percentage of the variability in your outcome variable is explained by your explanatory variables. It is given by the expression,
\(R^2 = \frac{RegSS}{TSS}\)
The central difference between simple and multiple linear regression is that the slope coefficients for the explanatory variables in multiple regression are partial coefficients. That is, it represents the “effect” on the response variable of a one-unit increment in the corresponding explanatory variable, holding constant the value of the other explanatory variable.
\[Y_i=\hat{\beta_0}+\hat{\beta_1}X_{i1}+\hat{\beta_2}X_{i2}+...+\hat{\beta_k}X_{ik}+\varepsilon_i\]
Recall that the goal of the OLS approach is to minimize the discrepancy between our observed \(Y_i\) and estimated \(\hat{Y_i}\). To find a set of estimated \(\beta\)s that can minimize this discrepancy is an optimization problem. All we have to do is to take the partial derivatives of the above equation with respect to each coefficient, set them equal to zero, and then solve it. However, as shown in the lecture, adding more and more \(X\) variables to our linear regression model makes the calculation more and more tedious. In order to find the estimated \(\beta\)s more efficiently, we use matrix algebra to help us do the calculation.
To do this, let’s first think of our data set as a matrix. The below equations are how each observation in our data set generated.
\[Y_1=\hat{\beta_0}+\hat{\beta_1}X_{11}+\hat{\beta_2}X_{12}+...+\hat{\beta_k}X_{1k}+\varepsilon_1\]
\[Y_2=\hat{\beta_0}+\hat{\beta_1}X_{21}+\hat{\beta_2}X_{22}+...+\hat{\beta_k}X_{2k}+\varepsilon_2\]
.
.
.
\[Y_n=\hat{\beta_0}+\hat{\beta_1}X_{n1}+\hat{\beta_2}X_{n2}+...+\hat{\beta_k}X_{nk}+\varepsilon_n\]
Rewrite them into a matrix form.
The system of equations is summarized as the y vector equal to the product of the X matrix of data times the \(\hat{\beta}\) vector of parameters plus the vector of disturbances, \(\varepsilon\). That said, the fitted linear model we are trying to find is: \(y = X\hat{\beta}+\varepsilon\). In the lecture, Lauren showed how to apply what we did when finding a pair of intercept and slope in simple linear regression to solve the optimization problem(i.e., minimize RSS) to find the coefficient vector \(\hat{\beta}\) using matrix algebra.
\[ \begin{aligned} S(\hat{\beta}) &= \sum E_i^2 = \varepsilon\varepsilon' = (y- X\hat{\beta})'(y- X\hat{\beta})\\ &= y'y-y'X\hat{\beta}-\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta} \\ &= y'y-(2y'X)\hat{\beta}+\hat{\beta}'(X'X)\hat{\beta} \end{aligned} \]
Instead of taking derivative with respect to each coefficient one at a time, we can take derivative directly with respect to the coefficient vector \(\hat{\beta}\).
\[ \begin{aligned} S(\hat{\beta}) &= \sum E_i^2 = \varepsilon\varepsilon' = (y- X\hat{\beta})(y- X\hat{\beta})'\\ &= y'y-y'X\hat{\beta}-\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta} \\ &= y'y-(2y'X)\hat{\beta}+\hat{\beta}'(X'X)\hat{\beta} \end{aligned} \] \[ \begin{aligned} \partial S(\hat{\beta}) &= \frac{\partial}{\partial \hat{\beta}}(y'y-(2y'X)\hat{\beta}+\hat{\beta}'(X'X)\hat{\beta}) \\ 0 &=0-2Xy'+ 2\hat{\beta}X'X \\ 2\hat{\beta}X'X &=2Xy'\\ \hat{\beta} &= \frac{Xy'}{X'X} \\ \hat{\beta} &= (X'X)^{-1}Xy' \end{aligned} \] Therefore, the least squares estimator is:
\[\hat{\beta}=(X'X)^{-1}X'y\]
Let’s use the Anscombe.txt data set as an example to
find the estimated coefficients by using the above matrix equation.
education = per capita education expenditures,
dollarsincome = per capita income, dollarsunder18 = proportion under 18 years old, per 1000urban = proportion urban, per 1000# load data
df2 <- read.delim("/Users/yu-shiuanhuang/Desktop/method-sequence/data/Anscombe.txt", sep = "")
head(df2)
## education income under18 urban
## ME 189 2824 350.7 508
## NH 169 3259 345.9 564
## VT 230 3072 348.5 322
## MA 168 3835 335.3 846
## RI 180 3549 327.1 871
## CT 193 4256 341.0 774
# convert each variable to a vector
edu <- df2$education
intercept <- rep(1, nrow(df2)) # remember to crease this vector with 1s!
inc <- df2$income
un18 <- df2$under18
urb <- df2$urban
X <- cbind(intercept, inc, un18, urb)
class(X) # this is a matrix
## [1] "matrix" "array"
We now have a vector edu and a matrix \(X\) containing our intercepts and our three
explanatory variables (income, under18,
urban). Let’s estimate our best fit parameters \(\beta_0, \beta_1,\), \(\beta_2\), and \(\beta_3\) by applying the estimator we
derived in the lecture: \(\hat{\beta} =
(X'X)^{-1}X'y\).
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% edu
t(beta.hat)
## intercept inc un18 urb
## [1,] -286.8388 0.08065325 0.8173377 -0.1058062
Compare with the canned routine in R to confirm the matrix calculation was correct.
mod <- lm(edu ~ X-1) # remember to remove the first column in the X matrix bc that's the column with all the 1s we created just for the algebra calculation
beta.canned <- mod$coefficients
beta.canned
## Xintercept Xinc Xun18 Xurb
## -286.83876273 0.08065325 0.81733774 -0.10580623
How to come up with the predicted \(\hat{y}\)? \[\hat{y}=Xb\]
eduhat <- beta.hat[1] + df2$income*beta.hat[2] +
df2$under18*beta.hat[3] + df2$urban*beta.hat[4]
eduhat
## [1] 173.8168 199.0526 211.7006 207.0077 174.5936 253.2396 223.9983 210.5846
## [9] 179.8788 206.2476 213.3513 231.5918 233.2380 209.4856 211.0236 196.9737
## [17] 176.3859 188.1518 199.2828 195.3128 187.4341 250.0855 231.5108 252.0303
## [25] 182.3619 139.8510 169.8280 162.6433 176.5621 159.9772 156.6485 139.1353
## [33] 135.8968 148.7554 135.1561 174.0988 143.0524 175.0569 195.4585 171.6380
## [41] 205.2510 192.1739 197.6281 189.7953 190.1857 261.4657 212.7525 181.6204
## [49] 221.7760 355.7228 221.5298
How to come up with the residuals? \[e=y-\hat{y}=y-Xb\]
e <- edu - eduhat
e
## [1] 15.1831901 -30.0526055 18.2993676 -39.0077435 5.4064124 -60.2396370
## [7] 37.0016640 3.4153951 21.1211678 -34.2476465 -19.3513498 -42.5918115
## [13] -0.2380450 -0.4855509 50.9763644 37.0263340 0.6140641 -11.1518234
## [19] -12.2828447 -47.3127738 8.5658819 -2.0854980 15.4891839 -6.0302784
## [25] -2.3619147 9.1490037 -14.8279988 -13.6433457 -20.5621494 31.0227604
## [31] -16.6485112 -2.1352962 -23.8967859 -18.7553867 -1.1561315 -12.0987770
## [37] -8.0523620 -20.0569429 42.5415211 -1.6380384 32.7489708 -0.1738631
## [43] 29.3718741 17.2047442 10.8143055 -36.4656906 2.2475110 51.3796304
## [49] 51.2240417 16.2771794 -9.5297657
Now calculate the standard errors of the vector \(\hat{\beta}\). Recall the error variance is unknown but we have available the unbiased estimator \(S^2_E = (e'e)/(n-k-1)\), based on the residuals of our model fit. We can extract these residuals, calculate the estimator, and then use it to calculate the variance of the least squares coefficients \(\hat{V(b)} = S^2_E (X'X)^{-1}\).
As a quick refresher of concepts: the variance is a measure of a random variable’s “spread” or variation around its mean (a.k.a. its expected value), while the co-variance measures how correlated are the variations of two random variables with each other.
The variance-covariance matrix is a square matrix (i.e. it has the same number of rows and columns). The elements of the matrix that lie along its main diagonal (i.e. the one that goes from top-left to bottom-right contain the variances while all other elements contain the co-variances). Thus, the variance-covariance matrix of the fitted coefficients of a regression model contains the variances of the fitted model’s coefficient estimates along its main diagonal, and it also contains the pair-wise co-variances between coefficient estimates in the non-diagonal elements.
Why do we not only calculate the variance of regression coefficients but also the convariance of them?
The covariance matrix can be very useful for certain model diagnostics. If two variables are highly correlated, one way to think about it is that the model is having trouble figuring out which variable is responsible for an effect (because they are so closely related). This can be helpful for a whole variety of cases, such as choosing subsets of covariates to use in a predictive model; if two variables are highly correlated, you may only want to use one of the two in your predictive model.
S2E <- as.numeric((t(e)%*%e)/(nrow(df2)-3-1))
S2E
## [1] 712.5394
v.beta.hat <- S2E*solve(t(X)%*%X)
v.beta.hat # the diagonal numbers are the variance of each coefficient
## intercept inc un18 urb
## intercept 4214.5975090 -1.849526e-01 -9.7493677311 -0.1582902773
## inc -0.1849526 8.646281e-05 0.0001387409 -0.0002162611
## un18 -9.7493677 1.387409e-04 0.0255327479 0.0002084764
## urb -0.1582903 -2.162611e-04 0.0002084764 0.0011752674
sqrt(diag(v.beta.hat))
## intercept inc un18 urb
## 64.919931523 0.009298538 0.159789699 0.034282173
Compare with the canned routine in R to see if we got the same standard errors of coefficients.
| OLS | |||
|---|---|---|---|
| Variables | Estimates | S.E. | C.I. (95%) |
| Xintercept | -286.84 *** | 64.92 | -417.44 – -156.24 |
| Xinc | 0.08 *** | 0.01 | 0.06 – 0.10 |
| Xun18 | 0.82 *** | 0.16 | 0.50 – 1.14 |
| Xurb | -0.11 ** | 0.03 | -0.17 – -0.04 |
| Observations | 51 | ||
| R2 / R2 adjusted | 0.984 / 0.982 | ||
|
|||
With the standard errors of each coefficient, we can perform hypothesis testing and construct confidence intervals to determine the significance and reliability of the estimated effects (similar to what we did in the case of simple linear regression).
What would happen when there is perfect multicollinearity in our data (i.e. when two or more independent variables are perfectly correlated)?
# convert each variable to a vector
edu <- df2$education
intercept <- rep(1, nrow(df2)) # remember to crease this vector with 1s!
inc <- df2$income
un18 <- df2$under18
urb <- df2$urban
pmc <- 2*inc
cor(pmc, inc) # pmc is perfectly correlated with inc
## [1] 1
X <- cbind(intercept, inc, un18, urb, pmc)
class(X) # this is a matrix
## [1] "matrix" "array"
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% edu
## Error in solve.default(t(X) %*% X): Lapack routine dgesv: system is exactly singular: U[5,5] = 0
As you can see in the result of the above chunk, you cannot calculate
the matrix algebra when there is perfect multicollinearity. However,
when you run lm() in R, R will automatically delete the
variable that is perfectly correlated with the other variable(s) in the
dataset.
mod2 <- lm(edu ~ X-1) # remember to remove the first column in the X matrix bc that's the column with all the 1s we created just for the algebra calculation
|
|
OLS |
||
|---|---|---|---|
|
Variables |
Estimates |
S.E. |
C.I. (95%) |
|
Xintercept |
-286.84 *** |
64.92 |
-417.44 – -156.24 |
|
Xinc |
0.08 *** |
0.01 |
0.06 – 0.10 |
|
Xun18 |
0.82 *** |
0.16 |
0.50 – 1.14 |
|
Xurb |
-0.11 ** |
0.03 |
-0.17 – -0.04 |
|
Observations |
51 |
||
|
R2 / R2 adjusted |
0.984 / 0.982 |
||
|
|||
If there is high multicollinearity between variables but not
perfectly correlated, the matrix algebra still works, and
lm() will still spit out the estimates of the highly
correlated variable. However, the standard errors of coefficients will
be very high.
# convert each variable to a vector
edu <- df2$education
intercept <- rep(1, nrow(df2)) # remember to crease this vector with 1s!
inc <- df2$income
un18 <- df2$under18
urb <- df2$urban
set.seed(1234)
pmc <- 2*inc + rnorm(nrow(df2), 5, 1)
cor(pmc, inc) # pmc is highly but not perfectly correlated with inc
## [1] 0.9999997
X <- cbind(intercept, inc, un18, urb, pmc)
class(X) # this is a matrix
## [1] "matrix" "array"
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% edu
t(beta.hat)
## intercept inc un18 urb pmc
## [1,] -290.6956 -1.4651 0.8188972 -0.1038287 0.7726427
eduhat <- beta.hat[1] + df2$income*beta.hat[2] +
df2$under18*beta.hat[3] + df2$urban*beta.hat[4] + pmc*beta.hat[5]
e <- edu - eduhat
S2E <- as.numeric((t(e)%*%e)/(nrow(df2)-4-1))
v.beta.hat <- S2E*solve(t(X)%*%X)
sqrt(diag(v.beta.hat))
## intercept inc un18 urb pmc
## 69.35278085 9.01847688 0.16172179 0.03651251 4.50787061
|
|
OLS |
|||||
|---|---|---|---|---|---|---|
|
Variables |
Estimates |
S.E. |
C.I. (95%) |
Estimates |
S.E. |
C.I. (95%) |
|
Xintercept |
-286.84 *** |
64.92 |
-417.44 – -156.24 |
-290.70 *** |
69.35 |
-430.30 – -151.10 |
|
Xinc |
0.08 *** |
0.01 |
0.06 – 0.10 |
-1.47 |
9.02 |
-19.62 – 16.69 |
|
Xun18 |
0.82 *** |
0.16 |
0.50 – 1.14 |
0.82 *** |
0.16 |
0.49 – 1.14 |
|
Xurb |
-0.11 ** |
0.03 |
-0.17 – -0.04 |
-0.10 ** |
0.04 |
-0.18 – -0.03 |
|
Xpmc |
0.77 |
4.51 |
-8.30 – 9.85 |
|||
|
Observations |
51 |
51 |
||||
|
R2 / R2 adjusted |
0.984 / 0.982 |
0.984 / 0.982 |
||||
|
||||||
TBD
TBD